pacman::p_load(ggplot2, ggiraph, plotly,
patchwork, DT, tidyverse,
ggrepel, ggthemes, hrbrthemes,
tidyverse) Take-home_Ex01
1. The task
The purpose of this take home exercise is to to reveal the demographic and financial characteristics of the city of Engagement, using appropriate static and interactive statistical graphics methods. This exercise requires a user-friendly and interactive solution that helps city managers and planners to explore the complex data in an engaging way and reveal hidden patterns.
2. Data preparation
2.1 Installing the data packages
The unzipped files have been saved into a new folder named data for better organization.
part_info <- read_csv("data/Participants.csv")Rows: 1011 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): educationLevel, interestGroup
dbl (4): participantId, householdSize, age, joviality
lgl (1): haveKids
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Participants.csv
Contains information about the residents of City of Engagement that have agreed to participate in this study.
- participantId (integer): unique ID assigned to each participant.
- householdSize (integer): the number of people in the participant’s household
- haveKids (boolean): whether there are children living in the participant’s household.
- age (integer): participant’s age in years at the start of the study.
- educationLevel (string factor): the participant’s education level, one of: {“Low”, “HighSchoolOrCollege”, “Bachelors”, “Graduate”}
- interestGroup (char): a char representing the participant’s stated primary interest group, one of {“A”, “B”, “C”, “D”, “E”, “F”, “G”, “H”, “I”, “J”}. Note: specific topics of interest have been redacted to avoid bias.
- joviality (float): a value ranging from [0,1] indicating the participant’s overall happiness level at the start of the study.
finance <- read_csv("data/FinancialJournal.csv")Rows: 1513636 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): category
dbl (2): participantId, amount
dttm (1): timestamp
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
FinancialJournal.csv
Contains information about financial transactions.
- participantId (integer): unique ID corresponding to the participant affected
- timestamp (datetime): the time when the check-in was logged
- category (string factor): a string describing the expense category, one of {“Education”, “Food”, “Recreation”, “RentAdjustment”, “Shelter”, “Wage”}
- amount (double): the amount of the transaction For explanation of Rent Adjustment, please refer to this link
Lets first examine the properties of the participants csv file.
head(part_info)# A tibble: 6 × 7
participantId householdSize haveKids age educationLevel interestGroup
<dbl> <dbl> <lgl> <dbl> <chr> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege H
2 1 3 TRUE 25 HighSchoolOrCollege B
3 2 3 TRUE 35 HighSchoolOrCollege A
4 3 3 TRUE 21 HighSchoolOrCollege I
5 4 3 TRUE 43 Bachelors H
6 5 3 TRUE 32 HighSchoolOrCollege D
# ℹ 1 more variable: joviality <dbl>
str(part_info)spc_tbl_ [1,011 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId : num [1:1011] 0 1 2 3 4 5 6 7 8 9 ...
$ householdSize : num [1:1011] 3 3 3 3 3 3 3 3 3 3 ...
$ haveKids : logi [1:1011] TRUE TRUE TRUE TRUE TRUE TRUE ...
$ age : num [1:1011] 36 25 35 21 43 32 26 27 20 35 ...
$ educationLevel: chr [1:1011] "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" "HighSchoolOrCollege" ...
$ interestGroup : chr [1:1011] "H" "B" "A" "I" ...
$ joviality : num [1:1011] 0.00163 0.32809 0.39347 0.13806 0.8574 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. householdSize = col_double(),
.. haveKids = col_logical(),
.. age = col_double(),
.. educationLevel = col_character(),
.. interestGroup = col_character(),
.. joviality = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary(part_info[, c("householdSize", "haveKids", "age", "educationLevel", "interestGroup", "joviality")]) householdSize haveKids age educationLevel
Min. :1.000 Mode :logical Min. :18.00 Length:1011
1st Qu.:1.000 FALSE:710 1st Qu.:29.00 Class :character
Median :2.000 TRUE :301 Median :39.00 Mode :character
Mean :1.964 Mean :39.07
3rd Qu.:3.000 3rd Qu.:50.00
Max. :3.000 Max. :60.00
interestGroup joviality
Length:1011 Min. :0.000204
Class :character 1st Qu.:0.240074
Mode :character Median :0.477539
Mean :0.493794
3rd Qu.:0.746819
Max. :0.999234
table(part_info$interestGroup)
A B C D E F G H I J
102 91 102 96 83 106 108 111 96 116
table(part_info$educationLevel)
Bachelors Graduate HighSchoolOrCollege Low
232 170 525 84
sum(is.na(part_info))[1] 0
Next we will visualize some variables in document to see if there are any anomalies. From the chart below, we see that there exist no participants with extremely large age or low
p1 <- ggplot(part_info, aes(x = age)) +
geom_bar() +
labs(title = "Distribution of Participants' Age",
x = "Age",
y = "No. of person")
fig <- ggplotly(p1)
fig <- fig %>%
layout(xaxis = list(title = 'Age'),
yaxis = list(title = 'No. of person'))
p2 <- ggplot(part_info, aes(x = householdSize)) +
geom_bar() +
labs(title = "Distribution of house hold size",
x = "Household size",
y = "Count")
fig2 <- ggplotly(p2)
fig2 <- fig2 %>%
layout(xaxis = list(title = 'Household size'),
yaxis = list(title = 'Count'))
fig3 <- subplot(fig, fig2, nrows = 1, titleY = TRUE, titleX = TRUE, margin = 0.1 ) %>%
layout(title = 'Outlier checking',
plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff'),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff')) %>%
layout(annotations = list(
list(
x = 0.2,
y = 1.0,
text = "Distribution of Participants' Age",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
),
list(
x = 0.8,
y = 1.0,
text = "Distribution of house hold size",
xref = "paper",
yref = "paper",
xanchor = "center",
yanchor = "bottom",
showarrow = FALSE
)
))
fig3Lets move on to examine the properties of the finance csv file.
Sneak peak of the first few entries in the dataset
head(finance)# A tibble: 6 × 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
str(finance)spc_tbl_ [1,513,636 × 4] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ participantId: num [1:1513636] 0 0 0 1 1 1 2 2 2 3 ...
$ timestamp : POSIXct[1:1513636], format: "2022-03-01" "2022-03-01" ...
$ category : chr [1:1513636] "Wage" "Shelter" "Education" "Wage" ...
$ amount : num [1:1513636] 2473 -555 -38 2047 -555 ...
- attr(*, "spec")=
.. cols(
.. participantId = col_double(),
.. timestamp = col_datetime(format = ""),
.. category = col_character(),
.. amount = col_double()
.. )
- attr(*, "problems")=<externalptr>
summary(finance[c("amount")]) amount
Min. :-1562.726
1st Qu.: -5.594
Median : -4.000
Mean : 20.047
3rd Qu.: 21.598
Max. : 4096.526
table(finance$category)
Education Food Recreation RentAdjustment Shelter
3319 790051 296013 131 11463
Wage
412659
sum(is.na(finance))[1] 0
Check for outlier in the amount variable. We first group the amount variables by the category. Then we do a box plot. From the chart we can observe that shelter has some abnormally small values to the negative end and wages has some exceptionally large values on the positive end. We may wish to take note of these in our analysis.
# Create a box plot of amount by category
ggplotly(ggplot(finance, aes(x = category, y = amount, fill = category)) +
geom_boxplot() +
xlab("Expense Category") +
ylab("Amount") +
ggtitle("Amount Spent by Expense Category"))